A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis

نویسندگان

  • Sarah E. Reese
  • Kellie J. Archer
  • Terry M. Therneau
  • Elizabeth J. Atkinson
  • Celine M. Vachon
  • Mariza de Andrade
  • Jean-Pierre A. Kocher
  • Jeanette E. Eckel-Passow
چکیده

MOTIVATION Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. RESULTS We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. CONCLUSION We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well. AVAILABILITY AND IMPLEMENTATION The gPCA R package (Available via CRAN) provides functionality and data to perform the methods in this article. CONTACT [email protected]

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

svaseq: removing batch effects and other unwanted noise from sequencing data

It is now known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. These sources of noise must be modeled and removed to accurately measure biological variability and to obtain correct statistical inference when performing high-throughput genomic analysis. We introduced surrogate variable analys...

متن کامل

Application of DNA Molecular Markers in Plant Breeding (Review article)

Plant Breeding has utilized a wide range of techniques and methods to improve the quality and quantity of plants. The molecular markers are the tools that have provided a new perspective for plant breeding advancements. This article has reviewed the various advantages and uses of molecular markers and the utilization of the high potential of natural polymorphisms within communities, combined wi...

متن کامل

Manifold Learning for Human Population Structure Studies

The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the "intrinsic dimensionality" of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a ...

متن کامل

The practical effect of batch on genomic prediction.

Measurements from microarrays and other high-throughput technologies are susceptible to non-biological artifacts like batch effects. It is known that batch effects can alter or obscure the set of significant results and biological conclusions in high-throughput studies. Here we examine the impact of batch effects on predictors built from genomic technologies. To investigate batch effects, we co...

متن کامل

Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies.

BACKGROUND During the past 5 years, high-throughput technologies have been successfully used by epidemiology studies, but almost all have focused on sequence variation through genome-wide association studies (GWAS). Today, the study of other genomic events is becoming more common in large-scale epidemiological studies. Many of these, unlike the single-nucleotide polymorphism studied in GWAS, ar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 29 22  شماره 

صفحات  -

تاریخ انتشار 2013